Mapping of mutation-sensitive sites in protein-like chains. 
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Abstract 



In this work we have studied, with the help of a simple on-lattice model, 
the distribution pattern of sites sensitive to point mutations ('hot' sites) in 
protein-like chains. It has been found that this pattern depends on the regular- 
ity of the matrix that rules the interaction between different kinds of residues. 
If the interaction matrix is dominated by the hydrophobic effect (Miyazawa 
Jernigan like matrix), this distribution is very simple - all the 'hot' sites can 
be found at the positions with maximum number of closest nearest neighbors 
(bulk). 

If random or nonlinear corrections are added to such an interaction matrix 
the distribution pattern changes. The rising of collective effects allows the 
'hot' sites to be found in places with smaller number of nearest neighbors 
(surface) while the general trend of the 'hot' sites to fall into a bulk part of a 
conformation still holds. 
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I. INTRODUCTION 



In this paper we study how the choice of a particular Hamiltonian is responsible for the 
distribution pattern of sites sensitive to point mutations in a Heteropolymeric chain. 

As shown in |J using a very simple model [Q], in every optimized sequence there 
are sites at which point mutations are likely to cause misfolding of the native state (we 
call them 'hot' sites), while there are other sites at which point mutations have no relevant 
thermodynamic effect ('cold' sites), and we call intermediate sites 'warm'. As known from 
both experimental and theoretical studies [l] usually proteins are made of few 'hot' sites 
while the majority of the other sites are 'cold' . Because of the thermodynamic importance 
of the 'hot' sites it is of general interest to investigate the principles guiding their positioning 
along the protein chain. 

The model used in this work describes the polymer as a chain of beads in a cubic lattice, 
interacting through the Hamiltonian 

H=W B iA(ri-rj), (1) 

where B^ is the contact energy between the two residues situated on the i-th and j-th 
positions, L is the length of the chain and A(rj — rj) has the value 1 if the i-th and j-th 
are nearest neighbors and zero if not. 

In literature, different choices of the matrix B^ have been used. The results of Ref. |].[2|] 
have been found with a matrix || whose elements are distributed according to a Gaussian, 
with average B = and standard deviation a = 0.3 (in units of kT = 0.6 kcal/mol). With 
these matrix elements, for every target structure it is possible to select sequences whose 
native state is unique, stable and kinetically accessible ||. It has been found that a reliable 
condition for a site to be 'cold' or 'hot' is intimately connected to the change in the native 
state energy caused by a point mutation: the bigger is this difference, the more probable for 
this site to be a 'hot' site. Hot sites of such sequences are mainly in the bulk sites of the 
native conformation, but can be on the surface as well, while some bulk sites can be rather 
insensitive to mutations (Fig.||a). 
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Repeating the same calculations with a random-generated, Gaussian-distributed set of 
interaction energies, we have observed similar distribution patterns for 'hot' and 'cold' sites 
(Fig.0b). 

Another choice of the interaction matrix can be made to take into account explicitly the 
hydrophobic effect encountered in real proteins. The simplest way is to choose the matrix 
Bij to be composed of only three different elements, namely Bhh, Bpp and Bhp = Bph, 
where Bhh < Bpp < Bhp- These elements are responsible for the interaction between 
hydrophobic (H) and polar (P) residues. Unfortunately it was found || that for a given 
target structure and 'HP' interaction matrix it is difficult to construct optimal sequences 
for which this target structure would be kinetically accessible. As a rule, the optimization 
of the sequence puts H-residues in the bulk sites (the sites with the greatest number of 
nearest-neighbors), and every substitution of an H residue with a P residue causes the chain 
misfolding (See Figs. 2 and 3 in Ref. 0). 

On the way between the random matrix and the highly regular HP interaction model 
stands the matrix deduces by Miyazawa and Jernigan in |]]. This matrix contains 210 differ- 
ent elements which can still be grouped into 3 big blocks, according to their hydrophobicity. 
In this case it is possible to find sequences for which the native state is both stabile and 
kinetically accessible. It was also found that, as in the situation with only two kinds of 
residues, 'hot' sites are invariably the sites with the highest number of contacts (bulk sites, 
Fig. 0d). 

The question again is what are the principles guiding the positioning of 'hot' sites along 
the protein chain and how a particular form of the interaction matrix can influence the 
distribution pattern of 'hot' and 'cold' mutation sites. This paper is organized as following. 
In the next section we will present a convenient representation of an interaction matrix as a 
function of 'mixing' parameter (3. Variation of the 'mixing' parameter will correspond to a 
change from highly ordered HP like interaction matrix at (3 — to highly nonlinear matrix 
at non-zero values of 'mixing' parameter. A distribution pattern of 'cold' and 'hot' sites will 
be investigated with the use of these interaction matrices. Next we will address the same 
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question by introducing a parameter of 'randomness' 7 which would allow us to investigate 
the mutation sites distribution pattern with highly ordered HP like interaction matrix at 
7 = and random Gaussian interaction matrix at large 7. Conclusions will be drawn at the 
end. 



II. LTW PARAMETERIZATION 

In the work by Lee, Tang and Wingreen || a particular interesting parameterization of 
semi-experimental MJ matrix |§ was introduced, as a consequence of its regularity. In their 
work it was shown that elements of the MJ interaction matrix can be very nicely fitted as 

B ai = q a + q-y + /3q a q 1 , (2) 

where q a is ascribed to a monomer of kind a and (3 is a constant. This parameterization 
involves 20 parameters, instead of 210 parameters of the MJ matrix, each specifying the 
strength of a particular residue. Using this parameterization we have found that, for the 
best fit, all q a are negative and range from —2.3 to 0, while (3 = —0.42. Furthermore, it has 
been pointed out that the origin of the additive term is due to the hydrophobic effect, while 
the second order term is responsible for the segregation of dissimilar residues. 

Using this parameterization, we can write the Hamiltonian of the chain in a form [ 10f 



H = qn+ T^qCq, (3) 

where n and q have dimensions equal to the number of monomers in the chain. The %- 
th coordinate of n is the number of nearest neighbors for the i-th monomer and the i-th 
coordinate of q specifies the strength of the residue in the z— th site. C is the contact matrix 
for a given conformation. In the above reference it has been shown that the Hamiltonian (|3]) 
fits very well the original MJ Hamiltonian and, what is more precious, it is very convenient 
to handle analytically. 

One of the problems that can be solved easily using the above Hamiltonian is, given a 
target structure (C matrix and n) and the composition of the chain in term of residues, to 



find the sequence which minimize the energy JTO| . Particularly, using the fact that the second 
order term is usually 2 — 3 times smaller than the first order term, a first order approximation 
solution can be found minimizing only the solvent exclusion term. The straightforward way 
is to choose a sequence so that the vectors q and n are as anti-parallel as possible, keeping the 
constraint of a fixed number of different kinds of monomers. Knowing that all components 
of q are negative while all components of n are positive, it is necessary to put the residues 
with the most negative value of g» in the sites of the target structure with the largest number 
of nearest-neighbors. The effect is, roughly speaking, to put 'hydrophobic' residues (i.e. low 
qi) inside the structure while keeping 'hydrophilic' ones (high g,) on the surface. The second 
order term in the Hamiltonian ([3]) is responsible for a fine tuning of residue distribution, 
mostly inside the hydrophobic/hydrophilic regions, and causes a further decrease of the 
sequence energy. 



III. MUTATIONS AND HOT SITES 

According to ]IJ, we label each mutation with the difference of the native state energy 
of the wild-type sequence and of the mutated sequence 

a^ oc = ^E(4-^') a ^-^)' ( 4 ) 

where Bfj is the interaction element associated with the wild-type sequence and Bfj with 
a mutated sequence. It has been shown that the information about the thermodynamic 
features of the mutated sequence is mostly contained in the value of AEi oc §, Jllfl . 

The energetic effects of mutations in a given site i are studied introducing the average 
AEi oc over the 19 possible mutations in this site, 

1 19 

19 « 3 

For optimized sequences there are few sites (up to 10%) where AEi oc is large (~ A, where 
A is the gap between the native state and the random conformations energy ||), while for 
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the others AEi oc is much lower. We call the former sites 'hot' and the latter 'cold'. If a 
mutation occurs in a 'hot' site, there is a high probability that it fills the gap, eliminating 
the feature of design and causing misfolding of the chain. 

For random interaction matrices whose elements have Gaussian distribution, the behavior 
of hot sites is the same as indicated in Ref. [|1|]. Hot sites can be both in the bulk and on the 
surface of the native configuration (See Fig. [l|b) with dominance of 'hot' sites in the bulk. 

On the other hand, in a model with only two kinds of residues (H and P), the optimization 
of the sequence puts H residues in the sites with the highest number of nearest neighbors. 
In this case (see Ref. [0]), every site containing a H residue is a hot site. 

We are now interested in studying the pattern of hot sites in a model where the interaction 
matrix contains both features of hydrophobic-hydrophilic separation and randomness or 
complicated non-linearity. 

IV. CONSTRUCTION OF THE MAP OF 'HOT' AND 'COLD' SITES 

To investigate the behavior of AE loc (i) we use the parameterized Hamiltonian @. It is 
straightforward to show that, for a wild-type sequence characterized by q 

L 

&E loc (i) = -{rii + /3^C ij g i )(g— < q >«), (6) 

j 

where < q > a = 1/20 Y^=i <la, is the average of the values q a corresponding to the 20 
monomers different from the wild-type. With this assumption we have found < q > a ~ 
-1.23. 

We shell consider first the case (3 = 0, which is exactly solvable, and then the conse- 
quences of the non-linear term introduced via non-zero 'mixing' parameter (3.. 

A. (3 = 

In the case (3 = the Hamiltonian contains only the solvent exclusion term Hq = J2i Qi^i- 
As we discussed earlier, optimization of the sequence, given a structure, consists in putting 
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the most hydrophobic residues into the sites with the greatest number of contacts, keeping 
with the constraint of fixed number of different residues, so that 

* - iHh- ( 7 ) 

For short chains and 20 letter code used in the model with (3 = the optimization 
procedure described above works reasonably good in constructing optimal sequences while 
for longer chains it seems that the model is not adequate to ensure a single ground state or 
a ground state well-separated from the other states. Nevertheless, the case (3 = is a good 
starting point to investigate the role of the different terms of Hamiltonian @. 

In this case equation (||) can be written as 



AEi oc (i) = -ni(qi- < q > a ). (8) 

Defining < q >= ^2 i=i q%j X^=i n i an d substituting (0) in (§) as an approximation for an 
optimal sequence, we obtain 

AE^ c (i) = -n t <q> (m - <q>a ). (9) 

< q > 

For typical values of < q >~ —0.6 and < q > a ~ —1.26 then 



&E loc {i) w OSmirii - 2.1). (10) 



The shape of this function is plotted in Fig. 0. It is interesting to compare the value AEi oc (i) 
for the optimized sequence with the one for a random sequence. We will identify, in the 
spirit of Random Energy Model ||12|| , 'hot' sites (as defined in Sect. Ill) with those sites in 



which the average impact of mutations is greater than for a random sequence. We define a 
sequence to be random if there is no correlation between the strength qi at a given site, and 
its number of nearest neighbors rij, so that 



AEf™ d (i) = -m{q- < q >„), (11) 



where q ranges between —2.3 and 0. The values that /S.El^ d (i) can assume are comprised 
between the two straight lines plotted in Fig.||]a. While the dependence of AEi oc (i) for the 



random sequence is linear, for a selected sequence it is quadratic. In the case of the selected 
sequence the quadratic behavior of the mutation energy versus the number of closest nearest 
neighbors induces a sharp distinction between bulk sites (n > 3) and surface sites. As clear 
form (Fig. |2|a), all bulk sites are 'hot', while surface sites are 'cold'. As consequence, 'hot' 
sites have a certain degree of symmetry in target structures, i.e. no one of the sites with 
the same number of nearest neighbors is privileged to the others. The map of the 'hot' and 
'cold' sites for a Hamiltonian (3 — and a 36 monomer target structure is presented on (Fig. 

BO- 



B. (3 ^ 

If (3 is not zero, but still small in absolute values, it is possible to minimize the energy 
of the sequence in two steps, first minimizing Hp =Q = nq and finding an initial trial optimal 
sequence q tr , and then re-minimizing the same Hp = $ with an effective n e ff = n + ^q tr C. As 
was shown in Ref. |TI| this procedure is quite reliable for small f3. 

Using the same approximation (0) as in the previous section for q tr one finds 

A^(<) « -n, < q > (1 + (3 < q > ^)(n, - ^2.) (12) 

where iVj = J2j Cijiij is the number of nearest neighbors of the nearest neighbors of site i 
and can assume values in the range {rij, 4rij + 1}. The expression of AE^ oc (i), then, takes 
into account, through the value of iVj, sites lying further than the nearest neighbors of i. 
As consequence of this, we can observe (Fig. p|b) a broadening of the range of values that 
AEi oc (i) can assume for each rij (we will refer to this broadening as 'energy bands' in our 
further discussion) 

From Fig. (^|b) it follows that the effect of the segregation term in (3) is to differentiate 
among the sites with the same number of closest nearest neighbors. As (3 ^ 0, the second 
shell of nearest neighbors starts playing its role, thus introducing a cooperative effects in the 
determination of AE^ c (z). This splitting of the degeneracies in the AE? oc (i) at (3 can 
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lead to the overlap of 'energy bands' thus leading to the possibility of encountering a 'hot' 
site on a surface and 'cold' site in the bulk. 

To summarize, from the point of view of single mutations, in the case (3 = the spectrum 
of mutations is composed by two main parts, namely mutations in bulk sites, with high 
value of AEi oc , and mutations in surface sites, with AE[ oc close to zero or negative. The 
effect of the coupling term is to broaden the range of the possible mutation energies for the 
mutation sites of the same type (bulk or surface) mixing the energetic levels corresponding 
to mutations in different kind of sites. So, the symmetry of hot sites in the target structure 
can be broken thus allowing some hot sites to be found on surface. 



C. Explicit Calculations with MJ parameterized Hamiltonian 

To investigate how the 'energy bands' depend upon the strength of the nonlinear con- 
tribution in the interaction matrix we have made some explicit calculations, using as target 
structure the 36mers chain displayed on (Fig. [I]). We have first calculated the best values 
of q a and (3 to fit MJ matrix according to Eq. |2|, finding (3 = —0.42. Using these values for 
q a , we varied (3 in the range (—1.5, 1.5), optimizing each time the sequence with a genetic 
energy minimization technique. The composition has been kept fixed for all values of (3 and 
chosen in such a way as to satisfy qi ~ n.j (condition for optimal composition at /3 — 0). For 
the sake of computational convenience, the interaction matrix elements have been rescaled 
to have zero average and standard deviation equal to 1.0. Then, we have plotted the value 
of AEi oc (i) for each lattice site, as function of (3. First, we consider the case @ < 0. The 
raising of different 'energy bands' is shown in Fig.|3]a. For —0.7 < (3 < all the bands lay 
into four distinct groups. The first two groups, which correspond to the sites with n, = 1 or 
Hi = 2, contribute to cold sites. The other two groups correspond to the sites with = 3 
or rii = 4 defining the 'warm' and 'hot' sites. This situation is very similar to the case of 
(3 = 0, where bands do not overlap and 'hot' sites are exclusively in the bulk. It is not 
surprising to find the same distribution pattern for the MJ matrix (/3 = —0.42) (Fig.[T]d) . 
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and HP-like interaction matrix (/3 = 0) (Figjljc). For —2.5 ^ (3 % — 1.0 the nonlinear part 
of the interaction matrix starts playing its role. As was shown in the previous section the 
second neighbor shell contribute to the value of AE^ oc (i). At these values of parameter (3 
we observe that some of the energy bands corresponding to 'warm' (n=3) and 'cold' sites 
(n=l,2) mix, while the bands corresponding to the 'hot' sites stay well separated from the 
other bands. 

In the case of (3 > (Fig.[3|b) the non-linear effect is much more dramatic. For (3 ^ 0.2 
all bands start mixing allowing 'cold' sites penetrate the bulk while pushing 'hot' sites on a 
surface. We can rationalise the different pattern of 'energy bands' at (3 < and f3 > by 
examining the Eq.|] and Hamiltonian |3|. Noticing that ^qCq is of the order of &qn < q > 
we can rewrite the Hamiltonian [3] in a form 

H = qn{l + ^<q>)+(3r 1 (13) 

where rj contains the collective effects and qn(l + §<<?>) describes a 'renormalized' 
hydrophobic effect. In the case /3<0,(|<g>)>0 and the interaction matrix is largely 
dominated by the hydrophobic effect, while the collective contribution to the Hamiltonian 
is not strong enough to substantially mix the 'energy bands'. 

For the j3 > case (f < q >) < and the renormalized hydrophobic effect becomes 
comparable or smaller than the collective term in the Hamiltonian. This allows all bands to 
mix substantially. 

D. Explicit Calculations for Random Hamiltonian 

Another interesting question which can be addressed is how our conclusions are modified 
by the addition of a random term in the (3 = Hamiltonian. To study this problem, we 
have chosen a Hamiltonian in the form 

1 L 

H = nq + j~J^e ij A(r i -r j ), (14) 
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where nq comes from the parameterization (0) of the MJ matrix with (3 = 0. The virtue 
of this Hamiltonian is in separation in a controllable manner the hydrophobic effect due to 
the nq and any other non-linear effects are modeled by the random term. The values of 
are taken from a Gaussian distribution with mean zero and standard deviation 1.0. In (Fig. 
^) the 'energy bands' are shown as a function of 'mixing' parameter 7. The overall pattern 
is exactly the same as discussed in the previous section for the case in which non-linear 
terms in the parameterization are switched on leading to the effect of 'band crossing' thus 
allowing 'warm' sites to appear on the surface and 'cold sites' in the bulk. Even in the case 
of 'band crossing', there is still a clear trend for the 'hot' and 'warm' sites to concentrate 
in the bulk of the structure. This is due to the fact that bulk sites, building the biggest 
number of contacts, still display the strongest response to point mutations. 

In the inserts of Figs.^|a,b and Fig.|] shown the energy of the optimal non-mutated se- 
quence for different values of (3 and 7, respectively. It is possible to observe a sudden 
decrease in energy of the optimal sequence as (3 and 7 increases signifying that the addition 
of non-linear terms into a Hamiltonian allows for a much better energy minimization. 

V. CONCLUSIONS 

In this work we have considered how the regularity of the interaction matrix influence 
the distribution pattern of 'hot' sites. It has been found that if the matrix is polarized 
(dominant hydrophobic effect, (3 ~ or 7 ~ 0) this distribution is very simple. All the 
'hot' sites can be found at the positions with maximum number of closest nearest neighbors 
(bulk). 

With increasing importance of non- linear terms (f3 7^ 0,7 7^ 0) the distribution pattern 
changes so that the 'hot' and 'warm' sites can be found in places with smaller number of 
nearest neighbors (surface) while the general trend of the 'hot' sites to fall into a bulk part 
of a conformation still holds. 

As pointed out above this can be rationalized by noticing that if mixing parameter is 
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different than zero each site starts feeling not only its nearest neighbors but also the more 
distant sites. This leads to a collective nature of the interactions giving rise to a modified 
distribution pattern of 'hot' sites. 
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FIGURES 

FIG. 1. a) Map of the 'hot' (black), 'warm' (dashed) and 'cold' (white) sites for the random 
Gaussian matrix, b) Map of the mutation sites for the randomly generated Gaussian matrix, c) 
Map of the mutation sites for the LTW parameterization of MJ matrix with (3 = 0. d) Map of the 
mutation sites for the MJ interaction matrix (HP like model). 

FIG. 2. a) Energy of mutation as a function of the number of closest nearest neighbors for 
LTW parameterized interaction matrix with (3 = 0. AEi oc for the optimized sequence exhibits 
nonlinear behavior leading to the sharp differentiation of 'hot' and 'cold' sites. While AEi oc for 
the random sequence can have both positive and negative values at any n leading to the possibility 
of finding a 'hot' mutation even at n = 1. b) Energy of mutation as a function of the number 
of closest nearest neighbors for LTW parameterized interaction matrix with (3 < 0. As second 
nearest neighbors start contributing to the mutation energy the energy line broadens thus allowing 
for the sites with equal number of closest nearest neighbors to have different interaction energies. 
All possible AEi oc are confined between the two parabolas shown on a graph. 

FIG. 3. Energy bands for 36 mutation sites. The interaction energy matrix is based on LTW 
parameterization with (3 parameter introducing a non-linear segregation energy in Hamiltonian. 
Different line types correspond to the sites with different number of closest nearest neighbors. So, 
for example, solid lines correspond to the mutation energy of the sites 16 and 27 that have the 
most number of closest nearest neighbors. Sites with 4,3,2 and 1 closest nearest neighbors are 
specified by solid, dashed, solid-dashed and dotted lines respectively, a) corresponds to [3 < b) 
corresponds to (3 > 0. In the insert the energy of the optimal non-mutated sequence is shown for 
different values of (3. 



14 



FIG. 4. Energy bands for 36 mutation sites. The interaction energy matrix is a mix of LTW 
parameterized matrix with (3 = and Gaussian random matrix. The mixing with random matrix 
is controlled by the parameter 7. At 7 = the interaction matrix is pure LTW with (3 = while at 
7 ~ 2.0 the elements of random matrix become comparable to the elements of the regular matrix. 
Different line types correspond to the sites with different number of closest nearest neighbors. So, 
for example, solid lines correspond to the mutation energy of the sites 16 and 27 that have the most 
number of closest nearest neighbors. Sites with 4,3,2 and 1 closest nearest neighbors are specified 
by solid, dashed, solid-dashed and dotted lines respectively. In the insert the energy of the optimal 
non- mutated sequence is shown for different values of 7. 
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Energy of mutation for (3=0 LTW parametrization 
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